So, welcome to today's HPC Cafe.
We're going to talk about handling many small files and how to manage the AI datasets.
And first of all, I think what we all would like to do is be like a sole user of the HPC
clusters so that we can have the high performance.
And what it feels like is like being a super scientist with a computer and then we find
out that it's actually really slow and it feels like being in the traffic jam.
The reason for that is because we're not alone on the cluster, but there are like a thousand
people using the clusters apart from ourselves.
Okay.
So, some jobs can feel really slow.
And we actually have a success story from the support where a user came to us and observed
fluctuating job runtimes and job cancellations.
And the performance of his job was around 28 hours on an Nvidia V100, which does not
really fit into our 24 hour queues, does it?
We analyzed the underlying problem and it was basically that the user had 120 gigabytes
of data in over 340,000 files, which were copied, not even copied, but which were accessed
via the file server.
And the bad performance is you're not alone on the cluster.
There are a lot of other people.
And the solution to this problem is data staging where we combined the 340,000 files into the
single archive file, which in the end was only 10 gigabytes large and then extracted
this file directly to the node local storage.
It took 13 minutes to transfer the data.
And in the end, the performance of the job was 12 hours on the Nvidia V100 and the user
was really happy about it.
So what can we learn here?
We have a runtime gain by providing the data locally on the compute node, which is really
nice and we from the support like it as well.
So we run some benchmarks on different data sets and I'm going to just quickly go through
it.
We have a medium sized data set with around 480,000 files.
In the end, it's about 450 gigabytes of data.
We have a small data set with 90,000 files and 3 gigabytes of data and a big data set
with 140 files and 1.5 terabytes of data.
The interesting thing is that when we archive files, we can go from 480,000 files to 21
files, which is definitely a lot less to copy.
Interestingly, we can also decrease the data size if we use compression.
I will go into detail a bit later.
And even for big data sets, we can decrease the size by around two thirds.
What we benchmarked was different file systems we have available, which is the node local
storage that is an SSD.
We have the ANVME, which is a LASTA-based file system that is accessible via InfiniBand
and has high performance I.O. probabilities.
And then we have the Arterian, probably you all of them know them as the work file system,
which is the central NFS server used for short and midterm storage and the general purpose
file system.
These all have different latencies and bandwidths.
Why would I want to copy data over the network, over the file systems that is really slow
if I can access the data really fast on the node local storage?
It's pretty obvious to me.
Presenters
Zugänglich über
Offener Zugang
Dauer
00:31:48 Min
Aufnahmedatum
2025-02-11
Hochgeladen am
2025-02-14 16:46:04
Sprache
en-US
Speaker: Dr. Anna Kahler, NHR@FAU
Slides: https://hpc.fau.de/files/2025/02/HPC-Cafe-Small-Files-AI-DataSets-Feb-11-2025.pdf
Abstract:
We invite you to join us for a discussion on data handling, including the possibility of NHR@FAU providing access to popular data sets. As part of this discussion, we will present an overview of the various file systems available for data storage at NHR@FAU, covering key topics such as data archive formats, data copying, archiving, compressing, and unpacking, as well as recommendations for the most effective programs to use. Additionally, we will share best practices for efficient data storage and access in your SLURM scripts.
By taking a few simple steps, many common data handling issues can be resolved, which is crucial given that NHR@FAU supports over 1,000 users and inefficient data usage can impact not only individual workflows but also those of colleagues. Despite the importance of this issue, we continue to observe inefficient data handling practices and believe it is essential to revisit this topic time and again.
Material from past events is available at: https://hpc.fau.de/teaching/hpc-cafe/